Last updated on 2017-03-08

Lec10 - Wed 3/8: Splines + LOESS Smoothers

What are splines?

Drawing

Splines: Calculus and Linear Algebra at work.

LOESS Smoother

The red curve is the LOESS smoothed curve:

Drawing

Splines vs Loess

Lec08 - Fri 3/3: Pseudocode for Cross-Validation

intro_splines.R Solutions

Copy this at the end of intro_splines.R from previous lecture:

# SOLUTIONS:
# 1. Note: Anything define in the ggplot() command trickles down to all the sub
# geoms. If you want to use data/aes() specific to only one particular geom,
# you define it inside:
ggplot(data=NULL, aes(x=x)) +
  # What we're given - the obs y:
  geom_point(data=real_life_augmented, aes(y=y)) +
  # What we won't typically know, but here we pretend to - the f(x) i.e. the signal:
  geom_line(data=simulated, aes(y=f), col="red", size=1) +
  # Our guess at f(x), i.e. f_hat(x) - our fitted/predicted values:
  geom_line(data=real_life_augmented, aes(y=.fitted), col="blue", size=1)

# 2. Let's toy around with df. In particular df=2, 10, 50, 99
real_life_augmented <- smooth.spline(real_life$x, real_life$y, df=50) %>%
  augment() %>%
  tbl_df()
ggplot(data=NULL, aes(x=x)) +
  geom_point(data=real_life_augmented, aes(y=y)) +
  geom_line(data=real_life_augmented, aes(y=.fitted), col="blue", size=1) +
  geom_line(data=simulated, aes(y=f), col="red", size=1)

What does df do?

Lec07 - Wed 3/1: Regression Part III

Recall

Solutions

To Chapter3_Lab.Rmd from last lecture (better viewed from HTML document):

# Load packages:
library(tidyverse)
library(MASS)
library(ISLR)
library(broom)

# Fit model. Boston would be the training data in a Kaggle competition. 
# i.e. we have the outcome variable median house value
model_SL <- lm(medv~lstat, data=Boston)

# new_values would be the test data in a Kaggle competition
# i.e. we don't have the outcome variable
new_values <- data_frame(lstat=c(5,10,15))

# This is the MSE using the Boston training data:
augment(model_SL) %>% 
  summarise(MSE=mean((medv-.fitted)^2))
# Or equivalently
augment(model_SL) %>% 
  summarise(MSE=mean((.resid)^2))

# This value does not exist, b/c we don't have the real y, so we can't compute
# the residuals:
augment(model_SL, newdata = new_values) %>% 
  summarise(MSE=mean((.resid)^2))

Full Model

# Fit model 
model_full <- lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat, data=Boston)
# See coefficients
tidy(model_full)

# This is the MSE using the Boston training data:
# Or equivalently
augment(model_full) %>% 
  summarise(MSE=mean((.resid)^2))

Polynomial Model

# Fit model 
model_polynomial <- 
  lm(medv~
       crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+
       I(crim^2)+I(zn^2)+I(indus^2)+I(chas^2)+I(nox^2)+I(rm^2)+I(age^2)+I(dis^2)+I(rad^2)+I(tax^2)+I(ptratio^2)+I(black^2)+I(lstat^2), 
     data=Boston)
# See coefficients
tidy(model_polynomial)

# This is the MSE using the Boston training data:
# Or equivalently
augment(model_polynomial) %>% 
  summarise(MSE=mean((.resid)^2))

Lec06 - Mon 2/27: Regression Part II

Regression in R

Today we do the following to a simple linear regression in R

  1. Fit the model \(\widehat{f}\)
  2. Explore the fit
  3. Use the fitted model to make predictions

broom Package

Lec05 - Fri 2/24: Regression

Names

Compare:

  • Col Archibald Gracie IV
  • Patrick O'Keefe

Example 1 Regression for Explanation

From biology/medicine. Let

  • \(Y\) = Blood-pressure
  • \(X\) = Dosage of blood-pressure medicine (in ml)
  • \(Y = \beta_0 + \beta_1 X\)
  • Intepretation of \(\beta_1\): For each additional ml increase in dosage, there is an on average (associated) effect of \(\beta_1\) units on blood-pressure.

Example 2 Regression for Explanation

From economics/sociology. Let

  • \(Y\) = Annual income
  • \(X\) = # of years of schooling
  • \(Y = \beta_0 + \beta_1 X\)
  • Intepretation of \(\beta_1\): For each additional year of schooling, there is an on average (associated) effect of \(\beta_1\) dollars on annual income.

Lec04 - Wed 2/22: Intro to Cross-Validation

Validation Set

Fig 5.1 p.177:

Drawing

LOOCV

Fig 5.3 p.179: Leave-One-Out Cross-Validation

Drawing

k-Fold CV

Fig 5.5 p.181: \(k=5\) Fold CV

Drawing

Course for Next Few Lectures

Formalize what we mean by:

  • Model: Relates predictors to outcomes. \(Y = f(\vec{X})\). Chapter [2, 2.1.3]
  • Score: We want to minimize the mean squared error. Chapter 2.2

Example: Data

Fig 2.2 p.17:

Drawing

Example: Model

Fig 2.2 p.17:

Drawing

Lec03 - Mon 2/20: Resampling for Prediction

Score

  • We submitted the following model to Kaggle: Predict Survived <- ifelse(Sex == "female, 1, 0)
  • We got a score (in this case "Categorization Accuracy" AKA proportion correct) of 0.76555.

\[Score = \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i)\] where

  • \(I(y_i = \widehat{y}_i) = 1\) if \(y_i = \widehat{y}_i\)
  • \(I(y_i = \widehat{y}_i) = 0\) if \(y_i \neq \widehat{y}_i\)

Titanic Data

  • Kaggle split the original complete passenger data into
    • train data (891 rows): used by you to train model
    • test data (418 rows): used by Kaggle to evaluate/score/validate model
  • However, the Evaluation indicates they further randomly split into
    • 50% (209 rows): used by Kaggle compute your leaderboard score
    • 50% (209 rows): used by Kaggle only at the end to determine the ultimate winner
  • This was done to prevent "overfitting" to the leaderboard: i.e. trying to game the leaderboard

Question of the Day

  • Kaggle limits to 10 entries per day i.e. 10 score computations
  • What if we want to train models and compute scores for ourselves more often?
  • However, we need true outcomes \(y_i\) to compute the score, which test doesn't have

Solution

  • Create pseudo-train and pseudo-test sets via resampling from train
  • Compute pseudo-scores by applying your model to pseudo-test

Resampling

Drawing

Example

Create disjoint pseudo_train and pseudo_test data sets using dplyr::anti_join

library(tidyverse)
# You may need to change your directory path:
train <- readr::read_csv("assets/Titanic/train.csv")

pseudo_train <- train %>% 
  sample_frac(0.8)
pseudo_test <- train %>% 
  anti_join(pseudo_train, by="PassengerId")

See RStudio Menu Bar -> Help -> Cheatsheets -> Data Manipulation.

Resampling

Compute your pseudo-score

pseudo_test %>% 
  # Create new column of predictions:
  mutate(Survived_predicted = ifelse(Sex == "female", 1, 0)) %>% 
  # Compute score:
  summarize(Score = mean(Survived == Survived_predicted))
## # A tibble: 1 × 1
##       Score
##       <dbl>
## 1 0.7865169

Compare this to Kaggle score of 0.76555. Why are they different?

Resampling

  • They are different b/c of sampling variability. Both at sampling and resampling stages!
  • Let's repeat the above 10 times and get 10 pseudo-scores:
##  [1] 0.764 0.820 0.787 0.809 0.770 0.781 0.837 0.747 0.781 0.719

In this case, we have variability due to the resampling. The average of the 10 scores is 0.781

Lec02 - Wed 2/15: Toolbox

Toolbox

  1. R: tidyverse and the pipe operator %>%
  2. DataCamp
  3. Kaggle

Tool 1: tidyverse

tidyverse is a set of packages that work in harmony because they share common data representations and API design.

install.packages("tidyverse")
library(tidyverse)

Read more on RStudio's Blog.

tidyverse Core Packages

These get loaded when you run library(tidyverse)

  • ggplot2 for data visualisation.
  • dplyr for data manipulation.
  • tidyr for data tidying.
  • readr for data import.
  • purrr for functional programming.
  • tibble for tibbles, a modern re-imagining of data frames.

tidyverse Non-Core Packages

These get installed with tidyverse

  • hms for times.
  • stringr for strings.
  • lubridate for date/times.
  • forcats for factors.

tidyverse Data Import Packages

These get installed with tidyverse

  • DBI for databases.
  • haven for SPSS, SAS and Stata files.
  • httr for web apis.
  • jsonlite for JSON.
  • readxl for .xls and .xlsx files.
  • rvest for web scraping.
  • xml2 for XML.

tidyverse Modelling Packages

These get installed with tidyverse

  • modelr for simple modelling within a pipeline
  • broom for turning models into tidy data

Tool 2: DataCamp

  • Browser-based interactive tool for learning R/python
  • Interface is like RStudio: wrapper around R console
  • You all have access to ALL courses for free for the semester!

DataCamp

Tool 3: Kaggle

Titanic Data

For \(i=1,\ldots,n\) passengers

  • ID variable: PassengerId
  • Outcome variable \(y_i\): Survived (binary)
  • 10 Predictors/covariates \(\vec{X}_i\) i.e. information about the passenger: Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.

Models for Predicting Survival:

Naïve models: Use no information about the passenger, i.e. no covariates

  1. Everyone dies: Predict
    • Survived == 0
  2. Everyone lives: Predict
    • Survived == 1
  3. Flip a weighted coin. Predict
    • Survived == 1 with probability \(p\)
    • Survived == 0 with probability \(1-p\)

Models for Predicting Survival:

  • Instead, let's use some information about the passengers: Sex?
  • Our model: Predict:
    • Survived == 1 if Sex == "female"
    • Survived == 0 if Sex == "male"

Outline of Competition

Chalk talk i.e. see your lecture notes.

Titanic Data

Load CSV's into R:

library(tidyverse)
gender_submission <- readr::read_csv("assets/Titanic/gender_submission.csv")
test <- readr::read_csv("assets/Titanic/test.csv")
train <- readr::read_csv("assets/Titanic/train.csv")
  • You may need to change your directory path.
  • readr:: just indicates the function is from the readr package. Use for disambiguation.

Training Set

The binary outcome varible Survived is included.

glimpse(train)
## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex         <chr> "male", "female", "female", "female", "male", "mal...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, ...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...

Test Set

Survived is now NOT included. There are 418 rows (passengers) you need to predict.

glimpse(test)
## Observations: 418
## Variables: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, ...
## $ Pclass      <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2,...
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Nee...
## $ Sex         <chr> "male", "female", "male", "male", "female", "male"...
## $ Age         <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18...
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0,...
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298",...
## $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7...
## $ Cabin       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B...
## $ Embarked    <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", ...

Submission Example

This is what you submit: 418 rows with

  • the ID variable PassengerId
  • your predicted outcome variable Survived
glimpse(gender_submission)
## Observations: 418
## Variables: 2
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, ...
## $ Survived    <int> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,...

Submission Example

You can write to CSV via:

gender_submission %>% readr::write_csv("assets/Titanic/submission.csv")

Submission Example

After submitting to Kaggle, you can see your ranking.

Lec01 - Mon 2/13: Introduction

What is statistical/machine learning?

Drawing

What is statistical/machine learning?

  • In Feb 2016, I agonized over title of course: "Statistical Learning" vs "Machine Learning"
  • In Summer 2016: Ben Baumer at Smith posed:
  • "Instead of obsessing over Venn diagrams of what topics are within the domains of which disciplines, I ask instead…"
  • "… What if we blew up math, stats, CS, and all their legacies and started over? What would this field look like?"
  • New moniker: Machine learning… taught from a statistician's perspective

Definitions

  • Arthur Samuel (1959): Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed.
  • Me: Prediction

Examples:

Background

  • In the past, lots and lots of mathmatical science pre-requisites: algorithms, probability, linear algebra, etc.
  • Brought to you from folks at Stanford in 2001:
    Drawing

Unchartered Territory

  • New model: flatter pre-requisite structure. That means we'll have a few digressions for probability, linear algebra, etc.
  • Brought to you from folks at Stanford (again) in 2013:
    Drawing

Unchartered Territory

Toolbox

Drawing
  • R vs Python. A hammer vs a screwdriver
  • Argument for R: Wickham's Tidy tools manifesto
    • Reuse existing data structures: Consistency across packages e.g. ggplot2, dplyr, modelr
    • Design for humans: Bottleneck in most data analysis is thinking time, not computing time. i.e. avoid premature optimization

Cornerstone of Course

Final Project: Kaggle (rhymes with haggle) competition.

Sampling

Drawing

Resampling

Drawing